Doublets are prevalent in single-cell sequencing data and can lead to artifactual findings.
A number of strategies have therefore been proposed to detect them.
Building on the strengths of existing approaches, we developed scDblFinder,
a fast, flexible and accurate Bioconductor-based doublet detection method.
Here we present the method, justify its design choices, demonstrate its performance on both single-cell RNAseq and single-cell ATACseq data, and provide some observations on doublet formation, detection, and enrichment analysis.
Even in complex datsets, scDblFinder can accurately identify most heterotypic doublets, and was already found by an independent benchmark to outcompete alternatives.
High-throughput single-cell sequencing, in particular single-cell/nucleus RNA-sequencing (scRNAseq), has provided an unprecedented resolution on biological phenomena. A particularly popular approach uses oil droplets or wells to isolate single cells along with barcoded beads. Depending on the cell density loaded, a proportion of reaction volumes (i.e. droplets or wells) will capture more than one cell, forming ‘doublets’ (or ‘multiplets’), i.e. two or more cells captured by a single reaction volume and thus sequenced as a single-cell artifact. The proportion of doublets has been shown to be proportional to the number of cells captured (Bloom 2018; Kang et al. 2018). It is therefore at present common in single-cell experiments to have 10-20% doublets, making accurate doublet detection critical.
‘Homotypic’ doublets, which are formed by cells of the same type (i.e. similar transcriptional state), are very difficult to identify on the basis of their transcriptome alone (McGinnis, Murrow, and Gartner 2019). They are also, however, relatively innocuous for most purposes, as they appear highly similar to singlets. ‘Heterotypic’ doublets (formed by cells of distinct transcriptional states), instead, can appear as an artifactual novel cell type and disrupt downstream analyses (Germain, Sonrel, and Robinson 2020).
Experimental methods have been devised for detecting doublets in multiplexed samples, using barcodes (Stoeckius et al. 2018) or genotypes (e.g. single-nucleotide polymorphisms) to identify droplets containing material from more than one sample (Kang et al. 2018). While evidently useful, these identify only a fraction of the doublets, and fail to detect doublets formed by cells from the same sample, including heterotypic doublets.
A number of computational approaches have therefore been developed to identify doublets on the basis of their transcriptional profile (McGinnis, Murrow, and Gartner 2019; DePasquale et al. 2019; Wolock, Lopez, and Klein 2019; Bais and Kostka 2020; Bernstein et al. 2020).
Most of these approaches rely on the generation of artificial doublets by summing or averaging real cells, and score the similarity between them and the real cells.
For example, DoubletFinder generates a k-nearest neighbor (kNN) graph on the union of real cells and artificial doublets, and estimates the density of artificial doublets in the neighborhood of each cell (McGinnis, Murrow, and Gartner 2019).
In a similar fashion, one of the methods proposed by Bais and Kostka (2020), bcds, generates artificial doublets and trains a classifier to distinguish them from real cells.
Real cells that are classified with artificial doublets are then called as doublets.
Finally, another strategy proposed by Bais and Kostka (2020) is a coexpression score, cxds, which flags cells that co-express a number of genes that otherwise tend to be mutually exclusive across cells.
Xi and Li (2021a) recently reported a benchmark of computational doublet detection methods, using both simulations and real datasets with genotype-based true doublets.
Interestingly, despite several new publications, the benchmark identified the oldest method, DoubletFinder (McGinnis, Murrow, and Gartner 2019), as the most accurate.
However, another important observation from the benchmark was that no single method was systematically the best across all datasets, highlighting the necessity to test and benchmark methods across a variety of datasets, and suggesting that some strategies might have advantages and disadvantages across situations.
Here, we present the scDblFinder package, building on the extensive single-cell Bioconductor methods and infrastructures (Amezquita et al. 2019) and implementing a number of doublet detection approaches.
In particular, the scDblFinder method integrates insights from previous approaches and novel improvements to generate fast, flexible and robust doublet prediction. scDblFinder was independently tested by Xi and Li in the protocol extension to their initial benchmark and was found to have the best overall performance (Xi and Li 2021b).
As most approaches rely on some comparison of real cells to artificial doublets, it is crucial to appropriately simulate doublets. To this end, we first characterized real doublets using a dataset of genetically distinct cell lines (Tian et al. 2018). Because each cell line represents a distinct and more or less homogeneous transcriptional state, it is possible to identify the `cell types’ composing each doublet (Figure 1). Although often larger, the median library sizes of doublets were systematically smaller than the sum of the median library sizes of composing cell types (Figure 1A). We next investigated the relative contributions of the composing cell types using non-negative least square regression, expecting the larger cell types to contribute more to the doublet’s transcriptome.
# real doublet sizes compared to expected sizes
sce <- readRDS("../other_datasets/mixology10x5cl.SCE.rds")
md <- colData(sce)
md$class <- as.character(md$phenoid)
md$isDoublet <- md$demuxlet_cls=="DBL"
w <- which(md$isDoublet)
md$class[w] <- as.character(md$demuxlet.dbl.type[w])
ag <- aggregate(md$total_counts, by=list(class=md$class), FUN=median)
row.names(ag) <- ag$class
ag$isDoublet <- grepl("+",ag$class,fixed=TRUE)
ag$mad <- aggregate(md$total_counts, by=list(class=md$class), FUN=mad)[,2]
ag$sum <- sapply(strsplit(ag$class,"+",fixed=TRUE), FUN=function(x) sum(ag[x,"x"]))
p1 <- ggplot(ag, aes(sum, x, colour=isDoublet, label=class)) +
geom_abline(linetype="dashed", colour="grey") + geom_point(size=2) +
geom_segment(aes(x=sum,xend=sum,y=x-mad,yend=x+mad)) +
geom_text_repel(colour="black", min.segment.length=0) +
labs(x="Sum of median library sizes", y="Observed median library size")
# relative contributions of composing cells
dbls <- split(w, md$class[w])
dbls <- dbls[grep("\\+",names(dbls))]
cs <- sapply(split(seq_len(nrow(md))[-w],md$phenoid[-w]), FUN=function(i) rowSums(assay(sce)[,i,drop=FALSE]))
cs <- as.matrix(edgeR::cpm(calcNormFactors(DGEList(cs))))
dbls <- dplyr::bind_rows(lapply( setNames(names(dbls),names(dbls)), FUN=function(comb){
orig <- strsplit(comb,"+",fixed=TRUE)[[1]]
data.frame(
lsize=colSums(assay(sce)[,dbls[[comb]],drop=FALSE]),
expectedProp=ag[orig[1],"x"]/(ag[orig[1],"x"]+ag[orig[2],"x"]),
prop.type1=apply(as.matrix(assay(sce)[,dbls[[comb]],drop=FALSE]), 2, FUN=function(x){
tryCatch({
mod <- nnls::nnls(cs[,orig],x)
coef(mod)[1]/sum(coef(mod))
}, error=function(e) NA)
}))
}), .id="doublet.type")
saveRDS(dbls, file="data/doublet_contributions.rds")
ag2 <- t(sapply(split(seq_len(nrow(dbls)), dbls$doublet.type), FUN=function(w){
c(ratio=weighted.mean(dbls$prop.type1[w], dbls$lsize[w], na.rm=TRUE),
expected.ratio=dbls$expectedProp[w[1]])
}))
ag2 <- data.frame(doublet.type=row.names(ag2), ag2)
p2 <- ggplot(dbls, aes(expectedProp, prop.type1)) + geom_abline(linetype="dashed", col="grey") +
geom_point(aes(size=lsize, colour=doublet.type)) + scale_color_discrete(guide=FALSE) +
geom_line(data=ag2, aes(expected.ratio, ratio), size=1.5, colour="grey") +
geom_text_repel(data=ag2, aes(expected.ratio, ratio, label=doublet.type), direction="y", nudge_y=c(0.1,-0.1)) +
labs(x="Expected proportion (based on median cell type library size)",
y="Estimated proportion", size="Library size")
plot_grid(p1, p2, labels="AUTO", nrow=2, scale=0.95)
Figure 1: Characterization of real doublets. A: Observed median (and +/- one median absolute deviation in) library sizes per cell type against additive expectation for single cell and doublet types in a real dataset. The dashed line indicates the diagonal. B: Relative contribution of composing cell types in real doublets (each point represents a doublet) plotted against the expected relative contributions (based on the ratio between the median library sizes of the composing cell types). Values indicate the relative contribution of one of the two cell types to the doublet’s transcriptome. The dashed line indicates the diagonal, and the thick line indicates the weighted mean per doublet type.
Although differences in median library size across cell types were small (less than two-fold) compared to other datasets, we observed a weak association of the relative contributions with the relative sizes of the composing cell types (Figure 1B, p~2e-10). This effect was however considerably smaller than the variation within doublet type. This suggests that i) large variations in real cell size within a given cell type, and/or ii) large variations in the mRNA sampling efficiency that are independent for the two composing cells. In light of these ambiguities, we opted for a mixed strategy in the generation of artificial doublets: a proportion is generated by summing the libraries of individual cells, another by performing a poisson resampling of the obtained counts, and a third by re-weighting the contributions of cells depending on the relative median sizes of the composing cell types.
knitr::include_graphics("strategy2.svg")
Figure 2: Overview of the scDblFinder method.
Figure 2 gives an overview of the scDblFinder method.
As a first step, the dataset is reduced to its top most expressed features (1000 by default); if the cluster-based approach is used, the top features per cluster are instead selected.
If using the cluster-based approach (and not manually specifying the clusters), a fast clustering is performed (see Methods).
Artificial doublets are then created by combining cells of different clusters, proportionally to the cluster sizes.
In explicitly concentrating on inter-cluster doublets, we do not attempt to identify homotypic doublets, which are anyway virtually unidentifiable and relatively innocuous.
In doing so, we reduce the necessary number of artificial doublets (since no artificial doublet is ‘lost’ modeling homotypic doublets), and prevent the classifier from being trained to recognize cells that are indistinguishable from singlets (and would therefore call singlets as doublets).
An alternative strategy, also available through scDblFinder, is to generate fully random artificial doublets, and use the iterative procedure (see below) to exclude unidentifiable artificial doublets from the training.
In practice, the two approaches have comparable performances (see below), and they can also be combined.
Dimension reduction is then performed on the union of real cells and artificial doublets, and a nearest neighbor network is generated.
The network is then used to estimate a number of characteristics for each cell, in particular the proportion of artificial doublets among the nearest neighbors.
Rather than selecting a specific neighborhood size, the ratio is calculated at different values of k, creating multiple predictors that will be used by the classifier.
A distance-weighted ratio is also included.
Further cell-level predictors are added, including: projections on principal components; library size; and co-expression scores (based on a variation of Bais and Kostka 2020).
scDblFinder then trains gradient boosted trees to distinguish, based on these features, artificial doublets from real cells.
Finally, a thresholding procedure decides the score at which to call a cell by simultaneously minimizing the misclassification rate and the expected doublet rate (see Methods and Supplementary Figure 1).
A key problem with classifier-based approaches is that some of the real cells are mislabeled, in the sense that they are in fact doublets labeled as singlets. These can mislead the classifier. For this reason, classification and thresholding are performed in an iterative fashion: at each round, the real cells identified as doublets are removed from the training data for the next round.
Using the benchmark datasets from Xi and Li (2021a), we next optimized a number of parameters in the procedure, notably regarding features to include and hyperparameters, so as to provide robust default parameters.
Some features, such as the distance to the nearest doublet or whether the nearest neighbor is an artificial doublet, had a negative impact on performance (Supplementary Figure 2), presumably because it led to over-fitting.
Indeed, because artificial doublet creation can only approximate real doublets, a risk of classifier-based approaches is that the exact classification problem on which they are trained, namely distinguishing artificial doublets from real cells, slightly differs from the real problem on which they are expected to function (distinguishing real doublets from real singlets).
To test the hypothesis that this can lead to overfitting, we used scDblFinder without the dimensional reduction and kNN steps, which arguably involve a loss of information, and trained the classifier directly on the expression of the selected genes.
This resulted in a reduction in AUPRC in real datasets (Supplementary Figure 3 and Figure 3).
Finally, in line with a discrepancy between the trained and real problems, we observed that the variable importance calculated during training (Supplementary Figure 4) did not necessarily match that of the variable drop experiments (Supplementary Figure 2).
We finally optimized hyperparameters (Supplementary Figure 5) as well as the number of iterations (Supplementary Figure 6), finding that a relatively low number of iterations (2-3) was sufficient.
A previous version of scDblFinder was already compared, and shown superior to existing alternatives in an independent benchmark Xi and Li (2021a).
Here we reproduced this benchmark using the most recent versions of the packages, and including variant methods from the scDblFinder package (among which the updated version of scran’s original method, and now available in the scDblFinder package as computeDoubletDensity).
Figure 3 compares the performance of scDblFinder to alternatives across the real benchmark datasets.
scDblFinder has the highest mean area under the precision-recall (PR) curve (Supplementary Figure 8), ranking first in a majority of datasets, and otherwise typically very close to the top.
In addition, scDblFinder runs at a fraction of the time required by the next best methods (Figure 3, left).
e <- readRDS("../benchmark/benchmark.results.rds")
res <- readRDS("../analyses/optims_direct.rds")
res <- res[res$processing=="direct rawFeatures",]
res <- aggregate(res[,c("AUPRC","AUROC","elapsed")], by=res[,"dataset",drop=FALSE], FUN=mean)
res$method <- "directDblClassification"
e <- rbind(e, res[,colnames(e)])
datmax <- sort(apply(reshape2::dcast(e, method~dataset, value.var="AUPRC")[,-1],
2,na.rm=TRUE,FUN=max))
metmax <- sort(apply(reshape2::dcast(e, dataset~method, value.var="AUPRC")[,-1],
2,na.rm=TRUE,FUN=median))
e$dataset <- factor(e$dataset, levels=names(datmax))
e$method <- factor(e$method, levels=names(metmax))
levels(e$method) <- gsub("bcds","scds::bcds",levels(e$method))
levels(e$method) <- gsub("cxds","scds::cxds",levels(e$method))
levels(e$method) <- gsub("hybrid","scds::hybrid",levels(e$method))
getranks <- function(x){
y <- rank(x)
y[is.na(x)] <- NA
y
}
tr <- reshape2::dcast(e, method~dataset, value.var="AUPRC")
row.names(tr) <- tr[,1]; tr <- tr[,-1]
tr2 <- apply(tr,2,FUN=getranks)
e$AUPRC.rank <- apply(e[,1:2], 1, FUN=function(x) tr2[as.character(x[2]),as.character(x[1])])
e$point.colour <- viridisLite::viridis(100)[round(100*e$AUPRC)]
e$border.colour <- ifelse(e$AUPRC.rank>=(length(metmax)-0.5), "black", NA)
# e$border.colour <- ifelse(e$AUPRC.rank>=(length(metmax)-1.5),
# ifelse(e$AUPRC.rank>=(length(metmax)-0.5), "red", "black"), NA)
e$rounded <- round(e$AUPRC,2)
e$text.colour <- ifelse(e$rounded>=0.5,"black","white")
e$text.colour[e$AUPRC.rank==1] <- NA
nmeth <- length(unique(e$method))
scdbl.methods <- c("scDblFinder.clusters","scDblFinder.random","directDblClassification","computeDoubletDensity")
p1 <- ggplot(e, aes(dataset, method)) + geom_point(aes(colour=point.colour, size=AUPRC.rank^2)) +
scale_size(range=c(4,11), breaks=c(1, (nmeth/2)^2, nmeth^2), labels=c("worst","","best")) +
geom_text(aes(label=rounded, colour=text.colour), size=3) + geom_point(size=11, shape=21, aes(colour=border.colour)) +
scale_color_identity() + labs(size="AUPRC rank") +
theme(axis.text.x=element_text(angle=45, hjust=1),
axis.text.y=element_text(hjust=0.5, size=10.5, face="bold",
colour=ifelse(levels(e$method) %in% scdbl.methods,"black","grey30")),
axis.title.y=element_blank(), panel.grid=element_blank())
ag <- aggregate(e[,"elapsed"], by=e[,"method",drop=FALSE], FUN=mean)
p2 <- ggplot(ag, aes(method, elapsed)) + geom_col(width=0.75, fill="#00204DFF") +
#scale_y_continuous(trans = ggforce::trans_reverser('sqrt'), breaks=c(10,60,300)) +
geom_text(aes(y=200,label=paste0(round(elapsed),"s"),colour=elapsed>200), hjust=1) +
scale_colour_manual(values=c("TRUE"="white", "FALSE"="black"), guide=FALSE) +
scale_y_reverse() + coord_flip() + ylab("Mean running\ntime (s)") +
theme(axis.text.y=element_blank(), axis.title.y=element_blank(),
axis.text.x=element_text(angle=90), panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_blank(), panel.grid.major.x=element_line(colour="lightgrey"))
plot_grid(p2,p1,align="h", rel_widths=c(1,7))
Figure 3: Benchmark. Accuracy (area under the precision and recall curve) of doublet identification using alternative methods across 16 benchmark datasets. The size of the dots indicate the relative ranking for the dataset, and the numbers indicate the actual area under the (PR) curve. For each dataset, the top method is circled in black. Methods in bold are available through the scDblFinder package.
Several of the benchmark datasets have true doublets flagged by their mixing of single-nucleotide polymorphisms from multiple individuals (Kang et al. 2018). In most of these cases, however, the doublets include also inter-individual homotypic doublets (in the sense of being a combination of cells of the same type from different individuals), which are difficult to detect from gene expression (Figure 4A). In addition, they miss heterotypic doublets that are the result of the combination of different cell types from the same individual. Indeed, datasets where there is a full correspondence between cell type and individual (such as the human-mouse mixtures, e.g. hm-6k and hm-12k) typically have a much higher area under the ROC and PR curves (Figure 3). It is therefore likely that the accuracy reported in the benchmark is below the actual one in detecting heterotypic doublets. Based on the frequency of the different individuals and cell types in a dataset, it is possible to infer the expected rate of inter-individual homotypic doublets and within-individual heterotypic doublets. This, in turns, allows us to adjust the measured true positive rate (TPR) and false discovery rate and get a better picture of our ability to detect heterotypic doublets. Figure 4B shows such an analysis for a complex dataset from Kang et al. (2018) . The inflection point of the PR curve roughly coincides with the expected proportion of heterotypic doublets among those flagged as true doublets.
e <- readRDS("data/GSM2560248_noAmbiguous.processed.CD.rds")
# proportion homotypic doublets: these will be called as false negatives
prop.homotypic <- propHomotypic(e$scDblFinder.cluster)
# proportion within/intra-individual doublets: these will be called as false positives
prop.intraind <- propHomotypic(e$ind)
w <- which(e$scDblFinder.score>0.4220820)
x <- as.integer(e$multiplets[w]=="doublet")[order(e$scDblFinder.score[w])]
d <- data.frame(x=sum(!x)/length(x),
y=sum(x)/sum(e$multiplets=="doublet"))
p2 <- plotROCs(list(score=e$scDblFinder.score), e$multiplets=="doublet", fdr=TRUE,
prop.wrong.neg=prop.intraind, prop.wrong.pos=prop.homotypic,
showLegend=FALSE) + scale_color_manual(values=c("score"="darkviolet")) +
geom_point(data=d, aes(x,y), size=3.5)
plot_grid(dblTypesScheme(),
p2 + theme(legend.position="none"),
scale=0.95, labels="AUTO")
Figure 4: Doublet types and real accuracy of heterotypic doublet identification. A: Schematic (toy data) representing the different types of doublets. Within-genotype heterotypic doublets will wrongly be labeled as false positives, and inter-genotype homotypic will be labeled as false negatives. B: Adjusted PR curve for an example sample (GSM2560248). The two shaded areas represent the expected proportion of within-genotype heterotypic doublets (i.e. wrongly labeled as singlets in the truth) and inter-genotype homotypic doublets, respectively.
Adjusting for both types of error in the truth, the area under the PR curve is considerably better (0.82 instead of 0.64), and at the automatic threshold we estimate that 87% of heterotypic doublets can be identified with a real FDR of 32% (a similar analysis for a different sample is shown in Supplementary Figure 9).
Most doublet detection methods provide a ‘doublet score’ that is higher in doublets than in singlets,
and users are left to decide on a threshold above which cells will be excluded as doublets.
Because scDblFinder’s scores come from a classifier, they can directly be interpreted as a probability.
Nevertheless, a threshold needs to be set, and it should ideally be placed at the inflection point (assuming there is one) of the ROC or PR curve, so that most doublets and not too many singlets are excluded.
While these curves are typically not available in practice, we found that in most cases the scDblFinder scores are rapidly changing from high to low very close to the inflection point (Figure 5).
One possibility is therefore to use directly a fixed probability threshold to call doublets.
In some cases, however, there is a more gradual change in score (e.g. nuc-MULTI dataset), making it more difficult to establish a threshold in a non-arbitrary fashion.
Building on the fairly tight relationship (especially in 10x-based datasets) between the number of cells captured and the rate of doublets generated (Kang et al. 2018), another approach consists in setting the threshold based on the number of doublets (or heterotypic doublets) one expects to find in the data.
scDblFinder includes a thresholding method that combines both rationales, and attempts to minimize both the proportion of artificial doublets being misclassified and the deviation from the expected doublet rate (see Methods and Supplementary Figure 1).
The identified thresholds are shown in Figure 5A-B, and compared to thresholds based on the expected doublet rate in Figure 5C.
In general, scDblFinder thresholds are closer to the inflection point.
ds <- readRDS("data/benchmark_datasets_called.CD2.rds")
ds <- lapply(ds, FUN=function(x){
ndb <- nrow(x)*((nrow(x)*0.01)/1000)
ndb <- round(ndb * (1-propHomotypic(x$scDblFinder.cluster)))
th <- sort(x$scDblFinder.score, decreasing=TRUE)[ndb]
x$called.dbrOnly <- factor(1+(x$scDblFinder.score>=th),1:2,c("singlet","doublet"))
x
})
getRocs <- function(ds, score="scDblFinder.score", class="scDblFinder.class", merge=TRUE){
ret <- lapply(ds, FUN=function(x){
d <- data.frame(truth=as.integer(x$truth=="doublet"),
score=x[[score]],
called=x[[class]]=="doublet")
d <- d[!is.na(d$truth),]
d <- d[order(d$score, decreasing=TRUE),]
d$FPR=cumsum(!d$truth)/sum(!d$truth)
d$FDR=cumsum(!d$truth)/seq_along(d$truth)
d$TPR=cumsum(d$truth)/sum(d$truth)
d
})
if(merge) ret <- dplyr::bind_rows(ret, .id="Dataset")
ret
}
plotROCscores <- function(rocs, legend=FALSE, addRandom=TRUE){
w <- which(!rocs$called & !duplicated(rocs[,c("Dataset","called")]))
rot <- rocs[w-1,]
p <- ggplot(rocs, aes(FPR, TPR, colour=score)) + scale_x_sqrt() +
geom_line(colour="grey", aes(group=Dataset)) + geom_point(size=0.8) +
geom_point(data=rot, size=1, colour="black") + geom_point(data=rot, shape=9, size=5) +
scale_colour_viridis_c(direction=-1) +
guides(colour=guide_colourbar(title.position="top", barwidth=8, titlle.hjust=0.5)) +
labs(colour="scDblFinder score")
if(legend){
p <- p + theme(legend.position="bottom")
}else{
p <- p + theme(legend.position="none")
}
if(addRandom) p <- p +
geom_line(data=data.frame(FPR=(0:100)/100, TPR=(0:100)/100),
colour="darkgrey", linetype="dashed")
p
}
rocs <- getRocs(ds)
w <- which(!rocs$called & !duplicated(rocs[,c("Dataset","called")]))
rot <- rocs[w-1,]
cols <- pipeComp::getQualitativePalette(length(ds))
names(cols) <- names(ds)
p1 <- ggplot(rocs, aes(FPR, TPR, colour=Dataset)) + geom_line(size=1.3) +
geom_point(data=rot, size=4, colour="black") + geom_point(data=rot, size=3) +
scale_x_sqrt() + scale_colour_manual(values=cols) +
guides(colour=guide_legend(title.position="top", title.hjust=0.5, ncol=3)) +
theme(legend.position="bottom")
p2 <- plotROCscores(rocs, TRUE)
p3 <- plotROCscores(getRocs(ds, class="called.dbrOnly"))
plot_grid(
p1 + theme(legend.position="none"),
plot_grid(ggpubr::get_legend(p1), ggpubr::get_legend(p2), nrow=2),
p2 + theme(legend.position="none") + ggtitle("Using scDblFinder thresholds"),
p3 + ggtitle("Using expected # of heterotypic doublets"),
labels=c("A",NA,"B","C"), nrow=2)
Figure 5: Thresholding. ROC curves (with square-root transformation on the x axis) of the different benchmark datasets. In B-C, the colors indicate the scDblFinder doublet scores, and the crosses indicate the thresholds established through the thresholding method (B) or by taking the expected number of heterotypic doublets (C).
Multiple samples are often profiled and analyzed together, with the very common risk of batch effects (either technical or biological) across samples (Lütge et al. 2021). Therefore, while the cells from all samples might in principle provide more information for doublet detection than a single sample can afford on its own, this must be weighted against the risk of bias due to technical differences. To investigate this, we implemented different multi-sample approaches and tested them on two real multi-sample datasets with demuxlet-based true doublets, as well as a sub-sampling of them (Supplementary Figure 10).
The different multi-sample strategies had only a minor impact on the accuracy of the identification. Based on these results, the best overall strategy appears to be to process all samples as if they were one, however in our experience this can lead to biases against some samples when there are very large variations (e.g. in number of cells or coverage) across samples (not shown). This approach also greatly increases running time. In contrast, running the samples fully separately is computationally highly efficient, and is often equally accurate.
We next investigated whether scDblFinder could be applied to other types of single-cell data prone to doublets, such as single-cell ATACseq.
To evaluate this, we used the mixture of 10 cell lines from Granja et al. (2021) .
With default parameters, scDblFinder performed very poorly (Supplementary Figure 11).
This is chiefly because scDblFinder follows the common scRNAseq strategy of selecting an informative subset of the features, while ATACseq reads are typically sparsely distributed across the genome.
However, working with all features (i.e. peaks) is computationally very expensive.
An alternative to both approaches is to begin by reducing the size of the dataset by aggregating correlated features into a relatively small set, thereby using information from all.
These aggregated features can then directly be used as the space in which to calculate distances.
This method yielded equal or better performance than specialized single-cell ATACseq software (Supplementary Figure 11).
When artificial doublets are generated between clusters, we can keep track of the clusters composing them, and we reasoned that this information could be used to infer the clusters composing real doublets (henceafter referred to as ‘doublet origin’). Using a simulation as well as the aforementioned real dataset with doublets of known origins (mixture of 5 cell lines from Tian et al. (2018)), we assessed the accuracy of doublet origin prediction based on the nearest artificial doublets in the kNN. These proved inaccurate, both in real and simulated data (Supplementary Figure 12A-B). Even training a classifier directly on this problem failed (Supplementary Figure 12C-D). The problem appears to be that, due to the very large variations in library sizes (and related variations in relative contributions of the composing cells – see Figure 1B), doublets often contain a large fraction of reads from one cell type, and conversely a small fraction from the other cell type. As a consequence, we can typically call at least one of the two originating cell types, but seldom both. In the real dataset, at least one of the two originating cell type is correctly identified in 73% of doublets (random expectation: 36%), but both are correct in only 20% of cases.
While the identification of doublet origins remains a challenge, for the sake of completeness we nevertheless developed strategies to investigate whether certain doublet types were found more often than expected. Such enrichment could, for instance, indicate cell-to-cell interactions. We defined two forms of doublet enrichment (Figure 6A-B), and specified models to test each possibility: i) enrichment in doublets formed by a specific combination of celltypes, or ii) enrichment in doublets involving a given cell type, denoted ‘sticky.’
cln <- c(a=500,b=100,c=300,d=500)
n <- probs <- (cln/sum(cln)) %*% t(cln/sum(cln))
row.names(probs) <- row.names(n) <- names(cln)
probs[lower.tri(n)] <- n[lower.tri(n)] <- NA
diag(probs) <- diag(n) <- NA
probs <- probs/sum(probs,na.rm=TRUE)
set.seed(123)
probs1 <- probs2 <- probs
probs1[,4] <- probs1[,4]*2.5
probs2[2,4] <- probs2[2,4]*2.5
n1 <- n2 <- n
n1[upper.tri(n)] <- rpois(6,lambda=1500*probs1[upper.tri(n)])
n2[upper.tri(n)] <- rpois(6,lambda=1500*probs2[upper.tri(n)])
#n[lower.tri(n)] = t(n)[lower.tri(n)]
enr1 <- ( (n1/sum(n1,na.rm=TRUE))/probs )
enr2 <- ( (n2/sum(n2,na.rm=TRUE))/probs )
pdh <- function(x, column_title="Cell types", column_title_side="top", ...,
col=circlize::colorRamp2(c(0,1,2.5), c("blue","lightgrey","red"))){
Heatmap(x[-4,-1], col=col, cluster_rows=FALSE, cluster_columns=FALSE, na_col="white",
column_names_side="top", column_names_rot=90, row_title="Cell types",
column_title=column_title, column_title_side=column_title_side,
width=unit(2.5,"cm"), height=unit(2.5,"cm"), ...)
}
p1 <- plot_grid(
plot_grid(
ggplot(data.frame(celltype=names(cln), proportion=cln/sum(cln)), aes(celltype, proportion)) +
geom_col() + theme_cowplot() + xlab("Cell types") + ggtitle(" "),
grid.grabExpr(draw(pdh(probs, name="% of\ndbls", col=viridis::viridis(100)))),
scale=c(0.8,1)
),
NULL,
grid.grabExpr(draw((pdh(enr1, column_title="'Sticky'\ncell type", show_heatmap_legend=FALSE) +
pdh(enr2, name="fold-\nenrichment", column_title="Enriched\ncombination")),
column_title="Cell types", column_title_side="bottom")),
labels=c(" A: Expected doublet proportions", ""," B: Enrichment scenarios"),
nrow=1, hjust=0, rel_widths=c(6,1,5)
)
The `stickiness’ of each cluster (as proxy for cell types) can be evaluated by fitting a single generalized linear model on the observed abundance of doublets of each origin (see Methods). We tested the performance of this test under different underlying distributions using simulated doublet counts. The number of doublets of each type is generated from random expectation with or without added stickiness (as factors of 1 to 3 on the probability) using negative binomial distributions with different over-dispersion parameters (Figure 6C and Supplementary Figure 13). The quasi-binomial showed the best performance, followed by the negative binomial, but in all cases the p-values were not well calibrated and many false positives were reported at a nominal FDR<0.05. This was robust across different over-dispersion values (Supplementary Figure 13).
load("../analyses/enrichment_results.RData")
scores <- stick.scores
cols <- setNames(scales::hue_pal()(length(scores)), names(scores))
p2 <- plotROCs(lapply(scores, FUN=function(x) 1-x$FDR), scores[[2]]$truth, th=0.95, size=4, fdr=TRUE) + scale_colour_manual(values=cols)
leg <- ggpubr::get_legend(p2)
scores <- comb.scores
p3 <- plotROCs(lapply(scores, FUN=function(x) 1-x$FDR), scores[[2]]$truth, th=0.95, size=4, fdr=TRUE) + theme(legend.position="none")
plot_grid(p1,
plot_grid(
p2 + ggtitle("Cluster stickiness") + theme(legend.position="none"),
p3 + ggtitle("Specific combinations"), leg,
nrow=1, rel_widths=c(3,3,1), scale=0.95, labels=c("C","D","")),
nrow=2, rel_heights=c(2.1,3))
Figure 6: Doublet enrichment analysis. A-B: Doublet enrichment in a toy example. A: Proportion of different doublet types from random expectations based on the cell type abundances. B: The fold-enrichment over this expectation in two different doublet enrichment scenarios. C-D: Performance of the cluster stickiness tests (C) and tests for enrichment of specific combinations (D) using different underlying distributions.
We next sought to establish a test for the enrichment of specific combinations. Here, we simply computed the probability of the observed counts for each combination using different models (see Methods). We again tested this approach relying on different underlying distributions, on simulations with varying over-dispersion (Figure 6C). The negative binomial performed best, however all variants suffered a high false discovery rate.
The characterization of real doublets suggests a multi-layered variation in mRNA capture efficiency, and calls for a varied approach to artificial doublet generation. The scDblFinder package includes a set of efficient methods for doublet detection. In particular, the main scDblFinder approach uses mixed doublet generation approaches and integrates insights from previous approaches into a comprehensive doublet detection method that provides robustly accurate detection across a number of benchmark datasets, at a considerably greater speed and scalability than the best alternatives. Even in complex datasets, most heterotypic doublets can be accurately identified. Although the doublet scores given by scDblFinder can be directly interpreted as probabilities, simplifying their interpretation, the method also includes a trade-off thresholding procedure incorporating doublet rate expectations with classification optimization, thereby facilitating its usage. Finally, we further demonstrate that, with slight changes in parameters, the approach is also amenable to other data types such as single-nucleus ATAC-seq.
scDblFinder additionally provides utilities for identifying the origins of doublets (in terms of composing cell types) and test for different forms of doublet enrichment. At present, however, the value of such tests is limited by the difficulty of accurately identifying doublet origins. Further research will be needed to assess to what extent this can be improved.
In conclusion, we believe that scDblFinder, with its flexibility, accuracy and scalability, represents a key resource for doublet detection in high-throughput single-cell sequencing data.
Analyses were run on R version 4.0.3 (2020-10-10), Bioconductor version 3.12, using the following
versions for the Doublet identification packages: DoubletCollection 1.1.0, DoubletFinder 2.0.3, scDblFinder 1.7.1, scds 1.6.0, Scrublet 0.2.3. Packages were launched through DoubletCollection using the default parameters.
All code to reproduce the analyses and the figures is available at https://github.com/plger/scDblFinder_paper
Irlba-based singular value decomposition is first run using the scater package, and a kNN network is generated using the Annoy approximation implemented in BiocSingular.
Louvain clustering is then used on the graph.
If the dataset is sufficiently large (>1000 cells), a first rapid k-means clustering (using the mbkmeans package) is used to generate a large number of meta-cells, which are then clustered using the graph-based approach, propagating clusters back to the cells themselves.
Unless manually given, the expected number of doublets (\(e\)) is specified by \(e = n^2/10^-5\) (where \(n\) is the number of cells captured).
This is then restricted to heterotypic doublets using random expectation from cluster sizes or, if not using the cluster-based approach, using the proportion of artificial doublets misidentified.
The doublet rate is accompanied by an uncertainty interval (dbr.sd parameter), and the deviation from the expected doublet number for threshold \(t\) is then calculated as
\[ deviation_t = \begin{cases} 0 & \text{if } (o_t \geq e_{low} \land o_t \leq e_{high}) \\ 2 \cdot \frac{ \min(|o_t-e_{low}|, |o_t-e_{high}|) }{e_{low} + e_{high}} & \text{otherwise} \end{cases} \]
where \(o_t\) represents the number of real cells classified as doublets at threshold \(t\), and \(e_{low}\) and \(e_{high}\) represent, respectively, the lower and higher bounds of the expected number of heterotypic doublets in the dataset (based on the given or estimated doublet rate the dbr.sd parameter).
The cost function being minimized is then simply given by \(cost_t = FNR_t + FPR_t + deviation_t^2\), where the false negative rate (\(FNR_t\)) represents the proportion of artificial doublets misclassified as singlets at threshold \(t\), and the false positive rate (\(FPR_t\)) represents the proportion of real cells classified as doublets.
Since this is performed in an iterative fashion, the \(FPR\) is calculated ignoring cells which were called as doublets in the previous round.
Cluster `stickiness’ can be evaluated by fitting a single generalized linear model on the observed abundance of doublets of each origin, in the following way:
\[ \log(observed_i+0.1) = \log(e_i) + \beta_z \cdot \log(difficulty_i) + \beta_a a_i + \beta_b b_i + \beta_c c_i + ... +\epsilon_i , \] where \(observed_i\) and \(e_i\) represent the numbers of doublets formed by specific combination \(i\) of clusters which are respectively observed or expected from random combinations, and \(a_i\), \(b_i\) and \(c_i\) (etc) indicate whether or not (0/1) the doublet involves each cluster.
Because some doublets are easier to identify than others, some deviation from their expected abundance is typically observed.
For this reason, a \(\text{difficulty}_i\) term is optionally included, indicating the difficulty in identifying doublets of origin \(i\), estimated from the misclassification of scDblFinder‘s artificial doublets of that origin (by default, the term is included if at least 7 clusters).
A \(\beta_a\) significantly different from zero, then, indicates that cluster a forms more or less doublets than expected – if positive, it indicates cluster `stickiness.’
For the (quasi-)binomial distributions, logit was used instead of log transformation, and the mean of observed and expected counts was used as observational weights.
To account for the different identification difficulty across doublet types, we first fit the following global negative binomial model:
\[ \log(observed_i) = \alpha + \log(e_i) + \beta \cdot \log(difficulty_i), \]
Then, the fitted values are then considered the expected abundance, and the probability of each double type count given this expectation is calculated using either underlying distributions (for the negative binomial, the global over-dispersion parameter calculated in the first step is used).
The direct classification approach is implemented in the directDblClassification function of the package. It uses the same doublet generation, thresholding and iterative learning procedures as scDblFinder, but trains directly on the normalized expression matrix of real and artificial cells instead of kNN-based features. The hyperparameters were the same except for the maximum tree depth, which was increased to 6 to account for the increased complexity of the predictors.
For feature aggregation, scDblFinder first normalizes the counts using the Term Frequency - Inverse Document Frequency (TF-IDF) normalization, as implemented in Stuart et al. (2019). PCA is then performed and the features are clustered into the desired number of meta-features using mini-batch k-means (Hicks et al. 2021) or, if not available, simple k-means. The counts are then summed per meta-feature.
No competing interests were disclosed.
This work was supported by the Swiss National Science Foundation (grant number 310030_175841). MDR acknowledges support from the University Research Priority Program Evolution in Action at the University of Zurich.
We thank Miles Xi and Jingyi Jessica for help with their benchmark datasets; Jeffrey Granja for support with ArchR and its attached dataset; and the Robinson group and users for feedback.